Then and now…

How it was going

How it is going

What is this thing we call “bioinformatics”?

G statistics Statistics hypothesis\ntesting hypothesis testing statistics--hypothesis\ntesting power analysis power analysis statistics--power analysis linear\nmodeling linear modeling statistics--linear\nmodeling visualizations visualizations statistics--visualizations Bayesian\nstatistics Bayesian statistics statistics--Bayesian\nstatistics bioinformatics Bioinformatics algorithms algorithms bioinformatics--algorithms formalizations formalizations bioinformatics--formalizations standards standards bioinformatics--standards databases databases bioinformatics--databases cbio Computational biology data\nscience data science cbio--data\nscience functional\ngenomics functional genomics cbio--functional\ngenomics using tools using tools cbio--using tools sequence\nanalysis sequence analysis cbio--sequence\nanalysis Computers and science Computers and Science Computers and science--statistics Computers and science--bioinformatics Computers and science--cbio implementation implementation algorithms--implementation maintenance maintenance databases--maintenance data\nstewardship data stewardship data\nscience--data\nstewardship gene set\nenrichment gene set enrichment functional\ngenomics--gene set\nenrichment phylogenies phylogenies sequence\nanalysis--phylogenies p-values p-values hypothesis\ntesting--p-values effect sizes effect sizes hypothesis\ntesting--effect sizes hierarchical\nmodeling hierarchical modeling linear\nmodeling--hierarchical\nmodeling

Statistics

What is a p-value?

\(H_0\): The null hypothesis, no effect

\(H_1\): The alternative hypothesis, there is an effect

We run a test, we get a p-value, say \(0.03\). It is a probability.

Probability of what, exactly?

  1. Probability that \(H_0\) is true (probability that there is no difference), given the data

  2. Probability that \(H_1\) is true (probability that there is a difference), given the data

  3. Probability that the data is random

  4. Probability that the observations are due to random chance

  5. Probability of getting the same data by random chance

  • Probability of observing an effect at least as extreme given that \(H_0\) is true

Our intuition is bayesian, not frequentist

Frequentist Statistics Bayesian Statistics
1. Probability is defined as the long-run frequency of events 1. Probability represents a degree of belief or certainty about an event
2. Parameters (like the “true value”) are fixed but unknown quantities. 2. Parameters are treated as random variables with their own probability distributions.
3. Asking about the probability of a hypothesis does not make sense 3. Asking about the probability of a hypothesis is the main goal

Why is that important?

P-values are part of scientific language

  • Always use effect sizes
  • Never rely on p-values alone

Know their limits:

  • they control only type I errors (false positives)
  • they do not control type II errors (false negatives)

How Venn diagrams can fool scientists

COVID-19 study, both COVID-19 patients and non-COVID-19 patients are compared in two groups of people, G1 and G2.

We wanted to know whether the influence of COVID-19 is different in these two groups.

The results are artifacts!

Groups G1 and G2 were randomly drawn from the same population. They were not different at all.

What happens is, we are comparing significance with non-significance

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant

(Andrew Gelman and Howard Stern)

If a gene is significant in one comparison, and not significant in another, that does not mean that there is a difference between the two groups.

It simply means that we failed to detect the difference in one of the comparisons, but that is actually quite likely to happen!

Therefore:

Don’t say “there is no difference”. Say “we did not detect a difference”.

Tale of two papers

Tale of two papers

Tale of two papers

Tale of two papers

Tale of two papers

Lessons learned

  • A lot depends on how you analyze your data
  • This in turn depends on the questions you ask
  • The average “Methods” section is not sufficient for reproducible science!

Attempt to replicate 53 high-impact cancer biology papers:

” Second, none of the 193 experiments were described in sufficient detail in the original paper to enable us to design protocols to repeat the experiments, so we had to seek clarifications from the original authors.” (Errington et al., 2021)

Excel and gene names

Thank you

You can find a longer presentation along its source code at https://github.com/bihealth/howtotalk

Parts of this have been expanded to a longer text which can be found at https://bihealth.github.io/howtotalk-book/

A 5 day R crash course book is available at https://bihealth.github.io/RCrashcourse-book/

The statistical testing roulette

The statistical testing roulette

Subject. One mature Atlantic Salmon (Salmo salar) participated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at the time of scanning.

Task. The task administered to the salmon involved completing an open-ended mentalizing task. The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence. The salmon was asked to determine what emotion the individual in the photo must have been experiencing.

The error is widespread

Nieuwenhuis et al. found that half of the scientists who could have commited this error, did in fact commit this error.

Will “AI” change the field?

  • New deep learning methods are useful, but hard to use
  • Some of them are truly revolutionizing the field
  • There is still place for simpler ML algorithms
  • Ready to use LLMs (ChatGPT & Co.) have their use, but also limitations